{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Checking if a pair of stocks is cointegrated"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pandas as pd\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from statsmodels.tsa.stattools import adfuller\n",
    "import matplotlib.pyplot as plt\n",
    "import quiz_tests"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Set plotting options\n",
    "%matplotlib inline\n",
    "plt.rc('figure', figsize=(16, 9))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# just set the seed for the random number generator\n",
    "np.random.seed(2018)\n",
    "# use returns to create a price series\n",
    "drift = 100\n",
    "r1 = np.random.normal(0, 1, 1000) \n",
    "s1 = pd.Series(np.cumsum(r1), name='s1') + drift\n",
    "s1.plot(figsize=(14,6))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "offset = 10\n",
    "noise = np.random.normal(0, 1, 1000)\n",
    "s2 = s1 + offset + noise\n",
    "s2.name = 's2'\n",
    "pd.concat([s1, s2], axis=1).plot(figsize=(15,6))\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "price_ratio = s2/s1\n",
    "price_ratio.plot(figsize=(15,7)) \n",
    "plt.axhline(price_ratio.mean(), color='black') \n",
    "plt.xlabel('Days')\n",
    "plt.legend(['s2/s1 price ratio', 'average price ratio'])\n",
    "plt.show()\n",
    "print(f\"average price ratio {price_ratio.mean():.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculate hedge ratio with regression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Linear Regression\n",
    "\n",
    "Note that the LinearRegression().fit() expects 2D numpy arrays.  Since s1 and s2 are pandas series, we can use Series.values to get the values as a numpy array. Since these are 1D arrays, we can use numpy.reshape(-1,1) to make these 1000 row by 1 column 2 dimensional arrays"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "type(s1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "type(s1.values)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "s1.values.reshape(-1,1).shape"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "lr = LinearRegression()\n",
    "lr.fit(s1.values.reshape(-1,1),s2.values.reshape(-1,1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "hedge_ratio = lr.coef_[0][0]\n",
    "hedge_ratio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "intercept = lr.intercept_[0]\n",
    "intercept"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"hedge ratio from regression is {hedge_ratio:.4f}, intercept is {intercept:.4f}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    " ## Question\n",
    " Do you think we'll need the intercept when calculating the spread?  Why or why not?"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Calculate the spread"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spread = s2 - s1 * hedge_ratio"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(f\"Average spread is {spread.mean()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spread.plot(figsize=(15,7)) \n",
    "plt.axhline(spread.mean(), color='black') \n",
    "plt.xlabel('Days')\n",
    "plt.legend(['Spread: s2 - hedge_ratio * s1', 'average spread'])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Let's see what we get if we include the intercept of the regression"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spread_with_intercept = s2 - (s1 * hedge_ratio + intercept)\n",
    "print(f\"Average spread with intercept included is {spread_with_intercept.mean()}\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spread_with_intercept.plot(figsize=(15,7)) \n",
    "plt.axhline(spread_with_intercept.mean(), color='black') \n",
    "plt.xlabel('Days')\n",
    "plt.legend(['Spread: s2 - (hedge_ratio * s1 + intercept)', 'average spread'])\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Quiz\n",
    "### Check if spread is stationary using Augmented Dickey Fuller Test\n",
    "\n",
    "The [adfuller](http://www.statsmodels.org/dev/generated/statsmodels.tsa.stattools.adfuller.html) function is part of the statsmodel library.\n",
    "\n",
    "```\n",
    "adfuller(x, maxlag=None, regression='c', autolag='AIC', store=False, regresults=False)[source]\n",
    "\n",
    "adf (float) – Test statistic\n",
    "pvalue (float) – p-value\n",
    "...\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "def is_spread_stationary(spread, p_level=0.05):\n",
    "    \"\"\"\n",
    "    spread: obtained from linear combination of two series with a hedge ratio\n",
    "    \n",
    "    p_level: level of significance required to reject null hypothesis of non-stationarity\n",
    "    \n",
    "    returns:\n",
    "        True if spread can be considered stationary\n",
    "        False otherwise\n",
    "    \"\"\"\n",
    "    #TODO: use the adfuller function to check the spread\n",
    "    #adf_result = \n",
    "    \n",
    "    #get the p-value\n",
    "    #pvalue = \n",
    "    \n",
    "    print(f\"pvalue {pvalue:.4f}\")\n",
    "    if pvalue <= p_level:\n",
    "        print(f\"pvalue is <= {p_level}, assume spread is stationary\")\n",
    "        return True\n",
    "    else:\n",
    "        print(f\"pvalue is > {p_level}, assume spread is not stationary\")\n",
    "        return False\n",
    "    \n",
    "quiz_tests.test_is_spread_stationary(is_spread_stationary)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Try out your function\n",
    "print(f\"Are the two series candidates for pairs trading? {is_spread_stationary(spread)}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you're stuck, you can also check out the solution [here](pairs_candidates_solution.ipynb)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
